pyPLNmodels: analyze multivariate count data

Groupe de travail StatOmique

Bastien Batardière

MIA Paris-Saclay, INRAE

Joon Kwon

MIA Paris-Saclay, INRAE

Mahendra Mariadassou

MaIAGE, INRAE

Julien Chiquet

MIA Paris-Saclay, INRAE

10 mars 2025

Overview

  • Data: single cell RNA-seq data
  • Introduction: PLN models
  • Run some examples of \(\texttt{pyPLNmodels}\)
  • Focus on ZIPln and PlnAR

Motivation: multivariate count data

Single cell RNA-seq data.

Data tables

  • \(\texttt{endog}\): \(\mathbf Y\), endogenous count variables (read counts of transcripts \(j\) in sample \(i\))
  • \(\texttt{exog}\): \(\mathbf X\), exogenous variables (available covariates)
  • \(\texttt{offsets}\) \(\mathbf O\), sampling effort for transcripts \(j\) in sample \(i\)

Goals

  • understand covariates (regression, classification)
  • exhibit patterns (clustering, dimension reduction)
  • understand between-genes interactions (covariance analysis)
  • explain zero inflation

Illustration: scMARK dataset (Swechha et al. 2021)

from pyPLNmodels import load_scrna
rna = load_scrna(n_samples = 2000, dim = 3000)
rna.keys()
Returning scRNA dataset of size (2000, 3000)
dict_keys(['endog', 'labels', 'labels_1hot'])

Data description

FTL MALAT1 FTH1 TMSB4X CD74 SPP1 APOE B2M ACTB HLA-DRA ... RBM22 FBXL15 SNAPIN OSCAR NUDT22 CASP8 DNAJA4 COA6 SMG1 PLEKHF1
count 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ... 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
mean 62 119 36 65 34 10 13 55 33 22 ... 0 0 0 0 0 0 0 0 0 0
std 224 112 95 92 88 69 67 62 61 60 ... 0 0 0 0 0 0 0 0 0 0
min 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
25% 2 35 2 15 1 0 0 20 5 0 ... 0 0 0 0 0 0 0 0 0 0
50% 5 94 7 36 3 0 0 41 14 1 ... 0 0 0 0 0 0 0 0 0 0
75% 17 169 22 74 16 0 1 66 36 7 ... 0 0 0 0 0 0 0 0 0 0
max 4129 1171 2006 1179 1010 1792 1197 771 1463 500 ... 4 4 4 5 5 3 9 7 5 14

8 rows × 3000 columns

Data description

T_cells_CD4+    734
T_cells_CD8+    720
Macrophages     546
Name: standard_true_celltype_v5, dtype: int64

Data description

Data description

Background on
the Poisson-lognormal Family
Model, inference

The Poisson Log Normal model

PLN (Aitchison et Ho 1989) is a multivariate generalized linear model, where

  • the counts \(\mathbf{Y}_i\in\mathbb{N}^p\) are the response variables
  • the main effect is due to a linear combination of the covariates \(\mathbf{x}_i\in\mathbb{R}^d\)
  • a vector of offsets \(\mathbf{o}_i\in\mathbb{R}^p\) can be specified

\[ \quad \mathbf{Y}_i | \mathbf{Z}_i \sim^{\text{iid}} \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim \mathcal{N}({\underbrace{\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}}}_{{\boldsymbol\mu}_i},\boldsymbol\Sigma). \]

Typically, \(n\approx 10000s\), \(p\approx 10000s\), \(d \leq 10\)

\[\mathbb V[Y_{ij}] > \mathbb E[Y_{ij}] \]

Why use PLN models?

Suppose we aim to predict the cell type based on the scMARK dataset.

from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

def get_classif_error(data, y):
    data_train, data_test, y_train, y_test = train_test_split(data, y, test_size=0.33, random_state=42)
    lda = LDA()
    lda.fit(data_train, y_train)
    y_pred = lda.predict(data_test)
    return np.mean(y_pred != y_test)
rna = load_scrna(n_samples = 1000)
print('Classif error:',get_classif_error(rna["endog"], rna["labels"]))
Returning scRNA dataset of size (1000, 100)
Classif error: 0.31212121212121213
from pyPLNmodels import Pln
latent_variables = Pln(rna["endog"]).fit().latent_variables
print("Classif error: ", get_classif_error(latent_variables, rna["labels"]))
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 3.3 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Classif error:  0.17575757575757575

Qualitative assessment

Natural extensions of PLN

Various tasks of multivariate analysis

  • Dimension Reduction: rank constraint matrix \(\boldsymbol\Sigma\). (Chiquet, Mariadassou, et Robin 2018) \[\mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top), \quad \mathbf{C} \in \mathcal{M}_{pk} \text{ with orthogonal columns}.\]

  • Network inference: sparsity constraint on inverse covariance. (Chiquet, Mariadassou, et Robin 2019) \[\mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol\Sigma = \boldsymbol\Omega^{-1}), \quad \|\boldsymbol\Omega \|_1 < c.\]

Various tasks of multivariate analysis: Classification

  • Unsupervised: mixture model in the latent space (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{Z}_i \mid c_i = k \sim \mathcal{N}(\boldsymbol\mu_k, \boldsymbol\Sigma_k), \quad \text{for unknown memberships } c_i.\]

  • Supervised: maximize separation between groups with means (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{Z}_i \sim \mathcal{N}(\sum_k {\boldsymbol\mu}_k \mathbf{1}_{\{c_i= k\}}, \boldsymbol\Sigma), \quad \text{for known memberships } c_i.\]

Various tasks of multivariate analysis: Zero-inflation

  • Zero-inflation: Add a zero inflation latent variable (Chiquet, Mariadassou, et Robin 2021) \[\mathbf{W}_i \sim \mathcal{B}(\sigma(\boldsymbol{\mu}^{0}_i)), \quad \mathbf Y_i \sim (1 - \mathbf W_i)\odot \mathcal P(\exp(\mathbf Z_i))\]

  • Dimension reduction and Zero Inflation: Add a rank constraint on \(\boldsymbol \Sigma\)

Various tasks of multivariate analysis: autogressive models

  • Autoregressive: Assume dependency between individuals \[\mathbf{Z}_{i+1} = \Phi \mathbf Z_i + \epsilon_i \]

Estimation

Estimation

Estimate \(\theta = (\mathbf{B}, {\boldsymbol\Sigma})\), predict the \(\mathbf{Z}_i\), while the model marginal likelihood is

\[\begin{equation*} p_\theta(\mathbf{Y}_i) = \int_{\mathbb{R}_p} \prod_{j=1}^p p_\theta(Y_{ij} | Z_{ij}) \, p_\theta(\mathbf{Z}_i) \mathrm{d}\mathbf{Z}_i \end{equation*}\]

  • Untractable likelihood

  • EM requires to evaluate (some moments of) \(p_\theta(\mathbf{Z} \,|\, \mathbf{Y})\), but there is no close form!

  • \(\implies\) variational EM

Variational Inference

Use a proxy \(q_\psi\) of \(p_\theta(\mathbf{Z}\,|\,\mathbf{Y})\) minimizing a divergence in a class \(\mathcal{Q}\) (e.g, Küllback-Leibler)

\[\begin{equation*} q_\psi(\mathbf{Z})^\star \in \arg\min_{q\in\mathcal{Q}} KL\left(q(\mathbf{Z}), p_{\theta}(\mathbf{Z} | \mathbf{Y})\right). \end{equation*}\]

and maximize the ELBO (Evidence Lower BOund)

\[\begin{equation*} J(\theta, \psi) = \mathbb{E}_{\psi} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q_\psi(\mathbf{Z})] \end{equation*}\]

Optimisation methods

  • Standard approach: alternative maximization of the ELBO: VE-step and M-step.

  • Brute-force approach:

    \[(\theta^{(t+1)}, \psi^{(t+1)}) = (\theta^{(t+1)}, \psi^{(t+1)}) + \eta_t \nabla_{\theta, \psi} J(\theta^{(t)}, \psi^{(t)})\]

Estimator properties

  • Variance is available for the regression coefficient of PLN model (only) \(\implies\) confidence intervals and hypothesis testing.

  • Estimator may be biased for some models

\(\texttt{pyPLNmodels}\)

About PLNmodels

  • \(\texttt{pyPLNmodels}\) is the equivalent python package of the R package \(\texttt{PLNmodels}\) (Chiquet, Mariadassou, et Robin 2021).
  • With some hindsight
  • Emphasis is given on the optimization: simple and efficient.
  • Implementation of \(\texttt{PlnAR}\) and \(\texttt{ZIPlnPCA}\) models

How to fit a model

Like R-formula

from pyPLNmodels import Pln
pln = Pln.from_formula("endog ~ labels ", data = rna)
_ = pln.fit()
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 2.8 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

Explicitly

from pyPLNmodels import Pln
pln = Pln(rna["endog"], exog = rna["labels_1hot"], add_const = False)
_ = pln.fit()
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 2.4 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

Attributes and method

Available when calling \(\texttt{print(model)}\)`:

print(pln)
A multivariate Pln with full covariance.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC
   -226568.5         100        5350 245046.7454    231918.5

======================================================================
* Useful attributes
    .latent_variables .latent_positions .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
    None
* Additional methods for Pln are:
    .get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()

Insights

pln.show(figsize = (10, 8))

Insights

pln.plot_regression_forest(figsize = (11,6))

  • The \(\texttt{PlnNetwork}\) allows to infer genes that are highly correlated, by imposing a constraint:

  • \[\| \Sigma^{-1} \|_1 \leq \text{penalty}\]

  • \[\Sigma^{-1}_{ij} = \text{Corr}(Z_i, Z_j)\]

  • Still differentiable ELBO.

from pyPLNmodels import PlnNetwork
net = PlnNetwork(rna["endog"], penalty = 2000, compute_offsets_method = "logsum").fit()
fig, ax = plt.subplots(1, 1, figsize = (12, 5))
net.viz_network(ax=ax)
Setting the offsets as the log of the sum of `endog`.
Fitting a PlnNetwork model with with penalty 2000
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 2.9 seconds.
Last  criterion = 5.34e-06 . Required tolerance = 1e-06

Predict newly observed data

The \(\texttt{PlnLDA}\) allows to predict newly observed data. It performs LDA on the latent space.

Predict newly observed data

from pyPLNmodels import PlnLDA, plot_confusion_matrix
ntrain = 500
endog_train, endog_test = rna["endog"][:ntrain],rna["endog"][ntrain:]
labels_train, labels_test = rna["labels"][:ntrain], rna["labels"][ntrain:]
lda = PlnLDA(endog_train, clusters = labels_train).fit()
pred_test = lda.predict_clusters(endog_test)
Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 1.9 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Doing a VE step.
Done!
Setting the offsets to zero.
Doing a VE step.
Done!
Setting the offsets to zero.
Doing a VE step.
Done!
plot_confusion_matrix(pred_test, labels_test)

import seaborn as sns
from pyPLNmodels import ZIPln
zi = ZIPln(rna["endog"], exog = rna["labels_1hot"], add_const = False).fit()
Setting the offsets to zero.
Fitting a ZIPln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 6.5 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

Zero Inflated PLN: \(\texttt{ZIPln}\)

Zero Inflated PLN

  • Use two latent vectors \(\mathbf{W}_i\) and \(\mathbf{Z}_i\) to model excess of zeroes and dependence structure \[\begin{array}{rrl} \text{PLN latent space} & \boldsymbol{Z}_i = (Z_{ij})_{j=1\dots p} & \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{B}, \mathbf{\Sigma}), \\[1.5ex] \text{excess of zeros} & \boldsymbol{W}_{i} = (W_{ij})_{j=1\dots p} & \sim \otimes_{j=1}^p \cal B(\pi_{ij}), \\[1.5ex] \text{observation space} & Y_{ij} \, | \, W_{ij}, Z_{ij} & \sim (1-W_{ij})\mathcal{P} \left(\exp\{o_{ij} + Z_{ij}\}\right),\\ \end{array}\]

  • \(\rightsquigarrow\) Better handling of zeros +additional interpretable parameters.

  • Still overdispersion \[\mathbb V[Y_{ij}] > \mathbb E[Y_{ij}]\]

Zero-inflation modelling

\[ \pi_{ij} = \sigma( \boldsymbol X^{0}\boldsymbol B^0)_{ij}, ~ \boldsymbol X^0 \in \mathbb R^{n \times d_0}, ~ \boldsymbol B^0 \in \mathbb R^{d_0\times p}\]

  • Other modelling possibilities (one for all, one per dimension, etc. )

Identifiabiliy

Identifiability of the model is ensured if the covariates \(X^0\) are full rank.

Distribution of \(W_{ij}| Y_{ij}\)

\[W_{ij} | Y_{ij} \sim \mathcal{B}\left(\frac{\pi_{ij}}{ \varphi\left(\mathbf{x}_i^\top \boldsymbol B_j, \Sigma_{jj}\right) \left(1 - \pi_{ij}\right) + \pi_{ij}}\right) \boldsymbol 1_{\{Y_{ij} = 0\}}\]

with \(\varphi(\mu,\sigma^2) = \mathbb E \left[ \exp(-X)\right], ~ X \sim \mathcal L \mathcal N \left( \mu, \sigma^2\right)\).

Motivation of \(\texttt{ZIPln}\): microcosm dataset

  • Data: sequencing of the microbiota of 45 dairy cows
  • from 4 sites:
    • vagina
    • mouth
    • nose
    • milk
  • at 4 time points:
    • 1 week before calving
    • 1 month after
    • 3 months after
    • 7 months after

Data description

  • After data preprocessing \(\implies p = 1209\) Amplicon Sequence Variants and \(96\) % zeros.

\(\texttt{Pln}\) versus \(\texttt{ZIPln}\)

zi = ZIPln.from_formula("endog ~ site + time", micro).fit()
pln = Pln.from_formula("endog ~ site + time", micro).fit()
`exog_inflation` and `exog` are set to the same array. If you need different `exog_inflation`, specify it with a pipe: '|' like in the following: endog ~ 1 + x | x + y 
Setting the offsets to zero.
Now dataset of size torch.Size([300, 492]).
Fitting a ZIPln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 15.2 seconds.
Last  criterion = 5.49e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Now dataset of size torch.Size([300, 492]).
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 7.5 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06

Autoregressive Pln: \(\texttt{PlnAR}\)

  • Assume dependency between individuals \(\implies\) Time series or spatial 1D: \[ \begin{aligned} \mathbf{Z}_{i+1}|\mathbf Z_i & \sim \Phi \mathbf Z_i + \epsilon_i \\ \epsilon_i & \sim \mathcal{N}(\Phi \mu_i, \boldsymbol\Sigma_{\epsilon} )\\ \mathbf{Y}_{i+1} & \sim \mathcal{P}(\exp(\mathbf Z_{i+1})) \end{aligned} \]
  • Constraint: \[ \boldsymbol \Sigma_{\epsilon} = \Sigma - \Phi \boldsymbol \Sigma \Phi^\top \succ 0\]
  • Marginal laws: \[\mathbf{Z}_i \sim \mathcal{N}(\mu_i, \boldsymbol\Sigma)\]

Autoregressive constraints

\[\boldsymbol \Sigma_{\epsilon} = \boldsymbol \Sigma - \Phi\boldsymbol \Sigma \Phi^\top \succ 0\]

  • “Easy” cases:

    • \(\boldsymbol \Sigma\) diagonal and \(1 \succ \Phi \succ 0\) diagonal
    • \(\boldsymbol \Sigma\) full and \(1 > \Phi > 0\) scalar
  • Non positive definite case:

    • \(\boldsymbol \Sigma\) full and \(1 \succ \Phi \succ 0\) diagonal \(\implies\) \(\boldsymbol \Sigma_{\epsilon}\) non definite positive
  • More complex case:

    • \(\boldsymbol \Sigma\) full and \(1 \succ \Phi \succ 0\) full

Autoregressive constraints

\[ \begin{array}{c|cc} \Phi \ \backslash \ \Sigma & \text{Diagonal} & \text{Full} \\ \hline \text{Scalar} & \checkmark & \checkmark \\ \text{Diagonal} & \checkmark & \times \\ \text{Full} & \times & ?? \\ \end{array} \]

Motivation: crossover dataset

  • Data: number of exchange of genetic material between two DNA strand in a loci.

  • All the 27 chromosomes placed end to end \(\implies\) 500 loci, for \(\textbf{4 different species of sheep}\).

  • \(\implies\) 4 time series of 500 loci. Each loci depends on the previous one.

\(\texttt{PlnAR}\) diagonal

from pyPLNmodels import load_crossover, PlnAR
cross = load_crossover()
ar = PlnAR(cross["endog"], ar_type = "diagonal").fit()
Returning crossover dataset of size (500, 4)
Setting the offsets to zero.
Fitting a PlnAR model with autoregressive type diagonal.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400)  reached in 1.4 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

Conclusion

  • \(\texttt{pyPLNmodels}\) easy to use for many statistical tasks for count data
  • Focus on optimization and scalability when possible.
  • Majority of models gives out of memory error for \(n,p> 3000\) in CPU.

What’s next?

  • Implement a \(\texttt{PlnAR}\) with full autoregressive structure.
  • Implement a collection of \(\texttt{PlnNetwork}\) and \(\texttt{PlnMixture}\) to directly find the BIC of the best model.

References

Aitchison, J., et C. H. Ho. 1989. « The multivariate Poisson-log normal distribution ». Biometrika 76 (4): 643‑53.
Chiquet, Julien, Mahendra Mariadassou, et Stéphane Robin. 2018. « Variational inference for probabilistic Poisson PCA ». The Annals of Applied Statistics 12: 2674‑98. http://dx.doi.org/10.1214/18-AOAS1177.
———. 2019. « Variational inference for sparse network reconstruction from count data ». In Proceedings of the 19th International Conference on Machine Learning (ICML 2019).
———. 2021. « The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances ». Frontiers in Ecology and Evolution 9. https://doi.org/10.3389/fevo.2021.588292.
Swechha, Dylan Mendonca, Octavian Focsa, J. Javier Dı́az-Mejı́a, et Samuel Cooper. 2021. « scMARK an MNIST like benchmark to evaluate and optimize models for unifying scRNA data ». bioRxiv. https://doi.org/10.1101/2021.12.08.471773.